On March 17, 2020 President Trump referred to the Coronavirus as the “China Virus.” Shortly after, throughout the subsequent weeks, there was an uptick in verbal and physical attacks against Asian-Americans (Olson, 2020). One aspect of public health that is often thrown to the wayside is how influential public officials and leaders are in disseminating public health information. Moreover, not only can their words change the public’s views on a health matter but it can also shift a nation’s perspective on someone’s identity. The Asian Pacific Policy and Planning Council reported that in their first four weeks of receiving reports on COVID-19 anti-Asian discirimnation, they recieved more than 1,500 incidents (Jeung & Nham 2020). This abrupt spasm of racism mimics the kind faced by American Muslims, Arabs and South Asians in the United States after the terrorist attacks of 9/11 (Tavernise & Oppel Jr, 2020). However, when President George W. Bush urged tolerance of American Muslims, this time President Trump is using language that Asian-Americans say is inciting racist attacks.
In addition, given the influence of identity politics we may expect the term “China Virus” to be more polarizing to certain identities and states.The Pew Reserach Center discovered that nearly three-quarters of Repbulicans and Republican-leaning indepdents view China unfavoriably (Devlin, et al 2020. But the majority of reports to the AAPI were from Democrat leaning states like California, Washington and Illinois [Jeung & Nham 2020]. Moreover, states that have a higher number of reported positive cases tend to have a higher interest in Googling the term “China Virus” (2020). Thus, this begs the question: What factors of the poltiical, demographic and COVID-19 spheres best predicts the level of interest a state will have in the term “China Virus?”
In order to answer this question, we utilized Monte Carlo Markov Chains to build four predictive models that included some of the variables mentioned above. Model 1, a repeated measures model, is our simplest model, Model 2 is a normal regression based, Model 3 is a combination of normal and repeated measures, and Model 4, our most complex model, is longitudinal based. We then ran several different diagnostic tests such as MCMC and PP checks to evaluate the quality and structures of our models. While all the models had some different levels of predictive strength, our best model was a normal regression and repeated measures model, Model 3. This model proved to provide the best balance between complexity and quality.Model 3 captures important attributes that the normal model and repeated measures did not and it achieves similar results to our complex model, Model 4, without the heavy computational toll. Overall, we hope this project inspires others to explore the intereconnectdness of health, communication, politics and demography.
| Variables: | Description: |
|---|---|
|
State abbreivation and main identifier |
|
Political Leaning of the state. Either red, blue or purple. |
|
Percent of the Population that is White |
|
Percent of the Population that is Asian |
|
Number of reported positive COVID-19 cases |
|
Number of reported negative COVID-19 cases |
|
Date of report |
|
Total Number test results (Positive +Negative) |
|
Interest index from Google searches by state. Peak search day=100, all other days in set are based searches on relative to this peak day. |
|
States divided into five different regions: West, South, Mountain, Northeast, Midwest |
The purpose of the visualizations below is to aid us in deepening our understanding of how our variables of interest intereact with one another and how they are structured. We chose to investigate the variables that are listed above because in our background reserach, we found that these factors on their own have proven to indicate some kind of infleucne on our outcome variable, interest in the term “China Virus.” With our new combined datasets, we hope to see that our intuition, background reserach and the data align with one another to tell an interesting narrative.
This visualization aims to establish our understnading of what our outcome variable looks like in our data across the United States. It also provides information on the distribution and geographical patterns of several variables of interest (hover for additional information). As we can see, the Google interest in the term “China Virus” is highly varibale across all states. The states that are on the higher end of the scale are: Louisianna, Florida, Oklahoma, Michigan, Arizona, Nebraska, and Alaska. This implies that predicting interest in “China Virus” by region would most likely be ineffective.Additionally, the majority of these high interest states have a large number of positive reported cases. This is an interesting obesrvation given that reporting of positive cases differs across states and may be influencing the general public differently by state. Moreover, the majority of the high interest states are also designated as “red” with the exception of Michigan which is “blue.” This indcates that the variable State Color may be one of the most effective predictors for our models. This visual includes information on white and asian percentages by state but Figure 2 provides a more informative depiction of the geographic distribution.
percent_asian in our model will most likley be ineffective thus, it may be more informative to include percent_white as an indicator of diversity. While it sounds counter-intuitive the idea would be that the lower the percent_white is by state means that there is a higher percentage of non-white identifying individuals. Thus, this variable may prove influential in our models.
During the 2020-03-14 - 2020-03-21 week, Trump in an official press announcement labeled the Corona Virus as “China Virus” and we wanted to see how his comments affected search patterns across states.
This plot shows the relationship of “China Virus” search interest over our time period of interest grouped by State Color. For each State Color, we can see an overall increasing interest in the term “China Virus” but especially among the Democratic and Swing states. While this seems contradictory to our intuition this pattern can be explained by the the complicated nature of US poltics. We dug deeper into why this pattern would occur and found that Democrats tend to take more initiative in researching terms they have heard on the news, especially from the White House COVID-19 briefings, over Republicans. So while interest in the term China Virus is more popular in blue states, the intentions may be different than Red states. This finding is very important to note because when we include State Color in our models, we have to be wary on how we interpret the coefficients associated with Red, Blue and Purple.
Figure 4 aims to address the question: What structure does intrest in the “China Virus” look like and what is its variance look like too? As we can see, the majority of states behave relatively normal with a small bump at 0. One could argue against a normal distribution as some states like Deleware,Texas, Washington and Alabama look a bit right skewed but overall a normal distribution is the best at describing these densities as a whole. Additionally, we can see that the variability in Google interest in the term China Virus is has quite a large range between states. There are very few states that have high densities among the upper echelons of the interest scale but there are some interesting peaks of densities among the lower values. For example, we can see that Alaska, Wyoming and Iowa have unusual peaks around the 25-50 range. It is is also interesting interesting to note that there isn’t an obvious mean or median value of China Virus interest among the states. Again, this prompts more questions about the characteristics of the states themeselves. While this shows us that a Normal Regression model may suffice we still explored the possibility of varying slopes among states below.
Figure 5 helps inform us on what level of complexity our model may need to be. We originally visualized all 50 states slopes but for the purposes of this report, it is more effective to observe a select few. As we can see states like Iowa and Missouri have a negative slope which indicates that the rate of their interst in the term China Virus decreases over time. However states like Louisianna, Maryland, Michigan and Mississippi have upward slopes which indcates the opposite of Iowa and Missouri. Then there are the states that have a relatively flat slope which means rate of interest neither increases or decreases overtime. Thus, these plots tell us that we will want to explore a longitudinal model because states do seem to have varying rates of interest however, it may not be entirely necessary because the slopes do not seem to varying drastically in steepness.
For our simplest model we decided to use a repeated measures model. Our team decided that the repeated measures model was necessary component because of how our data is set up. As we can see in our dataset, each state has a value for their ChinaVirusInterest for each day in our target period (2020-03-14 - 2020-03-21). Given the ability to use repeated measures and our prior understanding of the varying characteristics (demographic,political,covid-impact) within different states, the repeated measures model allows us to capture these differences in ChinaVirusInterest with the \(\theta_i\) value which represents each state’s mean value.
\[\begin{aligned} Y_{ij}|\theta_i, \mu, \sigma_w, \sigma_b \sim N(\theta_i,\sigma_w^2)\\ \theta_i|\mu,\sigma_b \overset{ind}{\sim} N(\mu, \sigma_b^2)\\ \sigma_b,\sigma_w \sim Exp(...) \end{aligned}\]
\(Y_{ij} =\) ChinaVirusInterest per:
\(i=State\)
and \(j= Day\)
\(\theta_i =\) State i’s unique mean value
\(\sigma_w =\) within state variation
\(\sigma_b =\) between state variation
We decided that our MCMC model should take in values from a normal distribution because ChinaVirusInterest’s distribution is fairly normal. Although the tails are not as flat as we would want them to be, we can decided that a normal distribution was best to model ChinaVirusInterest.
For our second model, we decided that we wanted to understand what made some states more responsive to ChinaVirusInterest than others. As you saw in our research motivation section, we wanted to explore what explained the differences in interest for the “China Virus” term. Was it a political difference, a demographic or a covid-impact related difference? To do this we used a simple Normal Regression model with the following specifications. For our demographic specification we used percent_white, for our political specification we used StateColor and for our Covid-impact specification we used positive (#of positive cases).
\[\begin{aligned} Y_{i}|\beta_0, \beta_1, \beta_2,\beta_3,\beta_4 &\overset{ind}{\sim} N(\beta_0+ \beta_1X_{1i} + \beta_2X_{2i} + \beta_3X_{3i}+ \beta_4X_{4i},\sigma^2)\\ \beta_0,\beta_1, \beta_2,\beta_3 &\sim N(...,...)\\ \sigma &\sim Exp(...)\\ \end{aligned}\]
\(Y = ChinaVirusInterest\\\)
\(i = state\\\)
\(j = Days\\\)
\(X_{ij} = \text{Days}\\\)
\(X_2 = \text{percent_white}\; X_3 =\text{StateColor}\; X_4 =\text{Total Test Results}\\\)
\(\sigma = \text{Variance in }\) ChinaVirusInterest
For our third model, we decided to combine the first two models to first understand variations in political, demographic, covid-related variables and also be able to capture variation that we could not explain in model 2. Additionally, we kept the correlation structure that we saw in model 1, but we included the variables in model 2 so that we could explain why some states had a larger ChinaVirusInterest than others. Similar to model 1, model 3 allows us to capture differences in ChinaVirusInterest with the \(\theta_i\) value which represents each state’s mean value.
\[\begin{aligned} Y_{ij}|\theta,\mu,\beta_0, \beta_1, \beta_2,\beta_3,\beta_4,\sigma_w,\sigma_b &\sim N(\theta_i +\beta_1X_1 + \beta_2X_{2i} + \beta_3X_{3i}+ \beta_4X_{4i}, \sigma_w^2)\\ \theta_i|\mu, \sigma_b &\overset{ind}{\sim}N (\mu, \sigma_b^2)\\ \beta_0,\beta_1, \beta_2,\beta_3 &\sim N(...,...)\\ \sigma_w,\sigma_b &\sim Exp(...)\\ \end{aligned}\]
\(Y = ChinaVirusInterest\\\)
\(i = state\\\)
\(j = Days\\\)
\(X_{ij} = \text{Days}\\\)
\(X_2 = \text{percent_white}\; X_3 =\text{StateColor}\; X_4 =\text{Total Test Results}\\\)
\(\sigma_w = \text{within state variation} \\\)
\(\sigma_b = \text{between state variation}\\\)
Because of the structure of our data, we are also able to create a longitudinal model as we have repeated measures for ChinaVirusInterest and corresponding observations for Day. In this more complex model we allow for state-specific slopes and intercepts to predict their behavior of ChinaVirusInterest. As the Days increase we get closer to when Trump addressed the Corona Virus as “China Virus”. Thus if we see increasing slopes we might conclude that in that state, the effect of Trump addressing the Corona Virus as “China Virus” created an increase in the interest of the term and viceversa.
\[ \begin{split} Y_{ij} | b_0, b_1, \beta_0, \beta_1,\beta_2,\beta_3,\beta_4 ,\sigma_w, \sigma_{0b}, \sigma_{1b} & \sim N( b_{0i} + b_{1i} X_{ij}+ \beta_2X_{2i} + \beta_3X_{3i} +\beta_4X_{4i}, \; \sigma_w^2) \\ b_{0i} | \beta_0, \sigma_{0b} & \stackrel{ind}{\sim} N(\beta_0, \sigma_{0b}^2) \\ b_{1i} | \beta_1, \sigma_{1b} & \stackrel{ind}{\sim} N(\beta_1, \sigma_{1b}^2) \\ \beta_0,\beta_1,\beta_2,\beta_3,\beta_4 & \sim N(..., ...) \\ \sigma_w & \sim Exp(...) \\ \sigma_{0b} & \sim Exp(...) \\ \sigma_{1b} & \sim Exp(...) \\ \end{split} \] In this hierarchical model each state receives a specific slope and intercept. The first step is determined through determining the \(\beta_{0} - \beta_{4}\) values which represent the intercept, slope, and our three demographic interactions that we bring in from model 3. The \(\beta_{0}\) represents the intercept value for all states. From this value each state derives its individual \(b_{0i}\) which represents that state’s individual intercept value. In the diagram below we see that the \(b_{0i}\) values are normally distributed around the \(\beta_{0}\) and are distributed by \(\sigma_{0b}\). This same process occurs for determining state specific slopes however we use \(\beta_{1}\) and \(b_{1i}\).
The main purpose of this model is to find the magnitude of \(\sigma_{1b}\) since this deviation is what determines the significance of their truly being state level unique slopes or not, and therefore determines if we need to use this more flexible model to predict ChinaVirusInterest.
\(Y = ChinaVirusInterest\\\)
\(i = state\\\)
\(j = Days\\\)
\(X_{ij} = \text{Days}\\\)
\(X_2 = \text{percent_white}\; X_3 =\text{StateColor}\; X_4 =\text{Total Test Results}\\\)
\(\sigma_w = \text{within state variation} \\\)
\(\sigma_{0b} = \text{between state variation for intercept}\\\)
\(\sigma_{1b} = \text{between state variation for slopes}\\\)
\(X_{ij} = \text{Days}\)
\(b_{0i} = \text{State-Specific-Intercept}\)
\(\beta_0 = \text{Population-wide-Intercept}\)
\(b_{1i} = \text{State-Specific-Slope}\)
\(\beta_1 = \text{Population-wide-Slope}\)
# 4. Model Evaluation
In the output above we see that the within deviation is much narrow than the between deviation. This matches our intuition in the model that utilizing the repeated measures, fixed effects model will be able to explain a greater amount of the variation.
This correlation table shows us that there is relatively weak correlation within each given daily observation within a state. Although there is not a strong correlation, it is correct for us to utilize a repeated measuress model in order to account for the correlation within states.
As we can see in this mcmc trace plot, it looks like our MCMC is fairly stable but could do with a bit more stability as we see that the peaks are spread around the 40-42.5 range. We also see that our values for \(\sigma_b^2\) is greater than 0, our value of sigma is around 75 meaning that we must include random intercepts in our final model
Overall, we can see that our repeated measures model it tells us that the structure of our model is fairly reasonable, In other words, the assumption of using a normal model is fairly reasonable outside of the fact that the tails are a bit thicker than we would want them to be. This is because we are expecting a lot of states that will have multiple days with zero value, if no one in the state looks up the term. This is why our tails are a bit thicker than we would want them to be, specially around 0.
# Trace plots
mcmc_trace(NR_model,pars = c("sigma","(Intercept)","percent_white","StateColorred","StateColorpurple","positive","Day"),facet_args = list(ncol = 3, strip.position = "left"))
# Density plots
mcmc_dens_overlay(NR_model, pars = c("sigma","(Intercept)","percent_white","StateColorred","StateColorpurple","positive","Day"),facet_args = list(ncol = 3, strip.position = "left"))
Compared to the previous model, it looks that our chains are more stable and they have closer peaks.
Although our posterior predictive check is pretty similar to our previous model, we see that the peaks are a bit more closer together. And similar to the previous model the structure of the GLM is fairly reasonable.
Compared to the previous model, we see that our chains are more stable for the intercept and the percent_white variables as the peaks are closer together. We also see that our \(\sigma_b^2\) is almost always above 0, and the density has a mean around 75 similar to our repeated measures model.
As seen in our prior posterior predictive checks, they have similar errors around the tails. But it seems that our GLM structure is also reasonable for this model.
In the dens overlay we can see that our value for \(\sigma_{1b}^2\) is close to zero but we could argue that random slopes are necessary, as the mean of \(\sigma_{1b}^2\) is around 2-2.5 and given that we are talking about slopes with not a very high range 2-2.5 is pretty good range for differing slopes. Similar to our previous dens overlay checks our value of \(\sigma_b^2\) is almost always above zero and its mean is around 75, which means that random intercepts are necessary.
Overall all ppchecks tend to have the similar graph because we are assuming a normal relationship on the response variable for all of them.
## mae mae_scaled within_50 within_95
## Model 1 9.362509 0.5418717 0.5980392 0.9681373
## Model 2 10.622916 0.5764897 0.5490196 0.9289216
## Model 3 8.940521 0.5224663 0.6029412 0.9632353
## Model 4 8.769638 0.5027308 0.6274510 0.9656863
As we can see in the table above, the lowest mae was achieved by our most complex model, Model 4 which allowed for both random slopes and random intercepts. But the mae value is not that different from the mae acchieved by Model 3 which was more simple allowing only for random intercepts. Interestingly Model 1 our simplest model had a higher proportion of the data falling within the 95% posterior predictive interval.
We believe that the decreases in mae and increases of the proporiton of values observed within 50% of the posterior predictive interval do not justify increasing the complexity. Which is why we argue that Model 3 is the best model because it acchieves similar results to model 4 without making it computationally expensive. But if we want a more precise model we would use Model 4. But out mcmc_dens and our visualizations showed that there are different slopes across states and thus a more reasonable approach would be to use Model 4.
## y_new
## Min. :-15.69
## 1st Qu.: 33.70
## Median : 45.57
## Mean : 45.36
## 3rd Qu.: 57.01
## Max. :106.62
## y_new
## Min. :-28.83
## 1st Qu.: 34.48
## Median : 46.94
## Mean : 46.94
## 3rd Qu.: 59.49
## Max. :117.21
## y_new
## Min. :-25.66
## 1st Qu.: 38.37
## Median : 50.16
## Mean : 50.06
## 3rd Qu.: 61.82
## Max. :121.14
## # A tibble: 1 x 1
## mean
## <dbl>
## 1 47.6
## Parsed with column specification:
## cols(
## Models = col_double(),
## Mean = col_double(),
## ACTUAL = col_double(),
## Min = col_double(),
## `1QR` = col_double(),
## Median = col_double(),
## `3QR` = col_double(),
## Max = col_double()
## )
## # A tibble: 4 x 8
## Models Mean ACTUAL Min `1QR` Median `3QR` Max
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 45.3 47.6 -16.0 33.8 45.4 56.9 108.
## 2 2 47.0 47.6 -28.8 34.5 46.9 59.5 117.
## 3 3 50.0 47.6 -22.2 38.3 50.1 61.9 128.
## 4 4 53.2 47.6 -22.2 41.2 53.3 65.1 123.
In this table each model’s prediction of Florida on the final day of our target window compared with how the actual value in the set. We find that the predictions, especially with our more complex Models 3 & 4 are not particularly close, especially with Model 4 having an error of close to 6. However, we are aware that our MAEs are smaller in Models 3 and 4.
Truncated Normal Model** (0-100), really wide guesses.
Devlin, K., Silver, L., Huang, C. (2020). Amid Coronavirus Outbreak, Americans’ Views of China Increasingly Negative. Retrieved 5 May 2020, from https://www.pewresearch.org/global/2020/04/21/u-s-views-of-china-increasingly-negative-amid-coronavirus-outbreak/
Google Trends. (2020). Retrieved 5 May 2020, from https://trends.google.com/trends/explore?geo=US&q=china%20virus
Jeung, R. Nham, K. (2020). STOP AAPI HATE MONTHLY REPORT 4/23/20. Retrieved 5 May 2020, from http://www.asianpacificpolicyandplanningcouncil.org/wp-content/uploads/STOP_AAPI_HATE_MONTHLY_REPORT_4_23_20.pdf
Olson, H. (2020). Trump’s not the only one blaming China. Americans increasingly are, too. Retrieved 5 May 2020, from https://www.washingtonpost.com/opinions/2020/05/04/get-ready-an-election-all-about-china/
Tavernise, S. Oppel Jr, R. (2020). Spit On, Yelled At, Attacked: Chinese-Americans Fear for Their Safety. Retrieved 5 May 2020, from https://www.nytimes.com/2020/03/23/us/chinese-coronavirus-racist-attacks.html